391 research outputs found
Get the Most out of Your Sample: Optimal Unbiased Estimators using Partial Information
Random sampling is an essential tool in the processing and transmission of
data. It is used to summarize data too large to store or manipulate and meet
resource constraints on bandwidth or battery power. Estimators that are applied
to the sample facilitate fast approximate processing of queries posed over the
original data and the value of the sample hinges on the quality of these
estimators.
Our work targets data sets such as request and traffic logs and sensor
measurements, where data is repeatedly collected over multiple {\em instances}:
time periods, locations, or snapshots.
We are interested in queries that span multiple instances, such as distinct
counts and distance measures over selected records. These queries are used for
applications ranging from planning to anomaly and change detection.
Unbiased low-variance estimators are particularly effective as the relative
error decreases with the number of selected record keys.
The Horvitz-Thompson estimator, known to minimize variance for sampling with
"all or nothing" outcomes (which reveals exacts value or no information on
estimated quantity), is not optimal for multi-instance operations for which an
outcome may provide partial information.
We present a general principled methodology for the derivation of (Pareto)
optimal unbiased estimators over sampled instances and aim to understand its
potential. We demonstrate significant improvement in estimate accuracy of
fundamental queries for common sampling schemes.Comment: This is a full version of a PODS 2011 pape
What you can do with Coordinated Samples
Sample coordination, where similar instances have similar samples, was
proposed by statisticians four decades ago as a way to maximize overlap in
repeated surveys. Coordinated sampling had been since used for summarizing
massive data sets.
The usefulness of a sampling scheme hinges on the scope and accuracy within
which queries posed over the original data can be answered from the sample. We
aim here to gain a fundamental understanding of the limits and potential of
coordination. Our main result is a precise characterization, in terms of simple
properties of the estimated function, of queries for which estimators with
desirable properties exist. We consider unbiasedness, nonnegativity, finite
variance, and bounded estimates.
Since generally a single estimator can not be optimal (minimize variance
simultaneously) for all data, we propose {\em variance competitiveness}, which
means that the expectation of the square on any data is not too far from the
minimum one possible for the data. Surprisingly perhaps, we show how to
construct, for any function for which an unbiased nonnegative estimator exists,
a variance competitive estimator.Comment: 4 figures, 21 pages, Extended Abstract appeared in RANDOM 201
- …